Regression trees example 1

Author

Murray Logan

Published

05/07/2025

1 Preparations

Load the necessary libraries

library(gbm) # for gradient boosted models
library(car)
library(dismo)
library(pdp)
library(ggfortify)
library(randomForest)
library(tidyverse)
library(gridExtra)
library(patchwork)
library(easystats)

2 Scenario

Leathwick et al. (2008) compiled capture data of short-finned eels (_Anguilla australis) within New Zealand freshwater streams to explore the distribution of the eels for conservation purposes. The goal was to be able to model the presence/absence of the eels against a range of environmental characteristics so as to predict their more broader occurances and identify which predictors are the most important in the predictions.

eel

Format of leathwick.csv data file

Site Angaus SegSumT SegTSeas SegLowFlow
1 0 16.0 -0.10 1.036
2 1 18.7 1.51 1.003
3 0 18.3 0.37 1.001
4 0 16.7 -3.80 1.000
5 1 17.2 0.33 1.005
6 0 15.1 1.83 1.015
.. .. .. .. ..
Site Unique label for each site.
Angaus Presence (1) or absence (0) of Anguilla australis eels
SegSumT Summer air temperature (degrees celcius) at the river segment
scale
SegTSeas Winter temperature normalised to January temperature at the river
segment scale
SegLowFlow Forth root transformed low flow rate at the river segment scale
DSDist Distance to coast (km) (a downstream predictor)
DSDam Presence of known downsteam obstructions (a downstream predictor)
DSMaxSlope Maximum downstream slope (a downstream predictor)
USAvgT Upstream average tempeture (normalised for the river segment)
USRainDays Number of rainy days recorded in the upstream catchment
USSlope Slope of the river upstream
USNative Percentage of the upstream riparian vegetation that is native
Method Method used to capture the eels (categorical predictor)
LocSed Weighted average of the proportional cover of bed sediment
(1=mud, 2=sand, 3=fine gravel, 4=course gravel, 5=cobble, 6=boulder, 7=bedrock)

3 Read in the data

leathwick <- read_csv("../data/leathwick.csv", trim_ws = TRUE)
Rows: 1000 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Method
dbl (13): Site, Angaus, SegSumT, SegTSeas, SegLowFlow, DSDist, DSMaxSlope, U...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(leathwick)
Rows: 1,000
Columns: 14
$ Site       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ Angaus     <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,…
$ SegSumT    <dbl> 16.0, 18.7, 18.3, 16.7, 17.2, 15.1, 12.7, 18.1, 18.9, 18.2,…
$ SegTSeas   <dbl> -0.10, 1.51, 0.37, -3.80, 0.33, 1.83, 2.17, 1.00, 1.59, 0.7…
$ SegLowFlow <dbl> 1.036, 1.003, 1.001, 1.000, 1.005, 1.015, 1.001, 1.002, 1.0…
$ DSDist     <dbl> 50.2000, 132.5300, 107.4400, 166.8200, 3.9500, 11.1700, 42.…
$ DSMaxSlope <dbl> 0.57, 1.15, 0.57, 1.72, 1.15, 1.72, 2.86, 2.29, 0.40, 3.43,…
$ USAvgT     <dbl> 0.09, 0.20, 0.49, 0.90, -1.20, -0.20, 1.45, 0.47, 0.25, 0.0…
$ USRainDays <dbl> 2.470, 1.153, 0.847, 0.210, 1.980, 3.300, 0.430, 1.153, 0.8…
$ USSlope    <dbl> 9.8, 8.3, 0.4, 0.4, 21.9, 25.7, 9.6, 4.9, 9.8, 20.5, 3.9, 6…
$ USNative   <dbl> 0.81, 0.34, 0.00, 0.22, 0.96, 1.00, 0.09, 0.02, 0.74, 0.92,…
$ DSDam      <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
$ Method     <chr> "electric", "electric", "spo", "electric", "electric", "ele…
$ LocSed     <dbl> 4.8, 2.0, 1.0, 4.0, 4.7, 4.5, 4.3, NA, NA, 3.6, 3.7, 1.0, 3…
leathwick_test <- read_csv("../data/leathwick_test.csv", trim_ws = TRUE)
Rows: 500 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (12): Angaus_obs, SegSumT, SegTSeas, SegLowFlow, DSDist, DSMaxSlope, USA...
lgl  (1): Method

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(leathwick_test)
Rows: 500
Columns: 13
$ Angaus_obs <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
$ SegSumT    <dbl> 16.6, 16.8, 16.3, 15.6, 14.6, 16.7, 18.3, 16.4, 17.2, 15.7,…
$ SegTSeas   <dbl> 1.01, -0.51, 0.76, 1.56, -0.20, 0.80, 0.67, 1.44, -0.67, -0…
$ SegLowFlow <dbl> 1.017, 1.002, 1.023, 1.003, 1.023, 1.003, 1.005, 1.031, 1.0…
$ DSDist     <dbl> 5.2300, 2.2400, 162.2800, 4.0500, 127.0300, 2.4800, 94.0770…
$ DSMaxSlope <dbl> 0.29, 0.00, 5.14, 0.57, 1.72, 14.57, 0.57, 1.72, 1.72, 2.86…
$ USAvgT     <dbl> -1.40, 0.27, -0.60, 1.14, -1.90, -1.50, 0.54, 0.75, -1.80, …
$ USRainDays <dbl> 1.980, 0.460, 0.806, 3.300, 1.940, 1.980, 0.847, 2.249, 0.5…
$ USSlope    <dbl> 10.0, 0.7, 21.4, 0.9, 28.9, 22.0, 1.0, 4.8, 26.4, 26.1, 23.…
$ USNative   <dbl> 1.00, 0.00, 0.66, 0.75, 0.97, 0.99, 0.01, 0.02, 0.74, 0.99,…
$ DSDam      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,…
$ Method     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ LocSed     <dbl> 4.9, 2.3, 4.3, 1.0, NA, 2.8, NA, NA, 4.2, 4.1, 4.3, 1.2, NA…

4 Exploratory data analysis

scatterplotMatrix(
  ~ Angaus + SegSumT + SegTSeas + SegLowFlow + DSDist + DSMaxSlope + DSDam +
    USAvgT + USRainDays + USSlope + USNative + Method + LocSed,
  data = leathwick,
  diagonal = list(method = "boxplot")
)
Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
FALSE, : could not fit smooth
Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
FALSE, : could not fit smooth
Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
FALSE, : could not fit smooth
Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
FALSE, : could not fit smooth
Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
FALSE, : could not fit smooth
Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
FALSE, : could not fit smooth
Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
FALSE, : could not fit smooth
Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
FALSE, : could not fit smooth
Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
FALSE, : could not fit smooth
Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
FALSE, : could not fit smooth
Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
FALSE, : could not fit smooth
Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
FALSE, : could not fit smooth

5 Fit the model

In this case, we already have the training and tests sets - there is no need to partition out the data. Nevertheless, it is still worth setting the random seed to ensure repeatibility.

set.seed(123)

leathwick.gbm <- gbm(
  Angaus ~ SegSumT + SegTSeas + SegLowFlow + DSDist + DSMaxSlope + DSDam +
    USAvgT + USRainDays + USSlope + USNative + Method + LocSed,
  data = leathwick,
  distribution = "bernoulli",
  var.monotone = c(1, 1, 0, -1, -1, 0, 1, -1, -1, -1, 0, -1),
  n.trees = 10000,
  interaction.depth = 5,
  bag.fraction = 0.5,
  shrinkage = 0.001,
  train.fraction = 1,
  cv.folds = 3
)
(best.iter <- gbm.perf(leathwick.gbm, method = "OOB"))
OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv_folds>1 when calling gbm usually results in improved predictive performance.

[1] 2587
attr(,"smoother")
Call:
loess(formula = object$oobag.improve ~ x, enp.target = min(max(4, 
    length(x)/10), 50))

Number of Observations: 10000 
Equivalent Number of Parameters: 39.99 
Residual Standard Error: 9.253e-06 
(best.iter <- gbm.perf(leathwick.gbm, method = "cv"))

[1] 3514
summary(leathwick.gbm, n.trees = best.iter)

                  var     rel.inf
SegSumT       SegSumT 28.63496756
Method         Method 14.45068581
USNative     USNative  9.90559581
DSMaxSlope DSMaxSlope  8.58539604
USRainDays USRainDays  7.71419693
LocSed         LocSed  5.91417507
USSlope       USSlope  5.72775409
SegLowFlow SegLowFlow  5.25905437
USAvgT         USAvgT  4.96370837
SegTSeas     SegTSeas  4.52081951
DSDist         DSDist  4.22995178
DSDam           DSDam  0.09369467

6 Partial plots

leathwick.gbm %>%
  pdp::partial(
    pred.var = "SegSumT",
    n.trees = best.iter,
    inv.link = plogis,
    recursive = FALSE,
    type = "regression"
  ) %>%
  autoplot() +
  ylim(0, 1)

nms <- colnames(leathwick)
p <- vector("list", 12)
names(p) <- nms[3:14]
for (nm in nms[3:14]) {
  print(nm)
  p[[nm]] <- leathwick.gbm |>
    pdp::partial(
      pred.var = nm,
      n.trees = best.iter,
      inv.link = plogis,
      recursive = FALSE,
      type = "regression"
    ) |>
    autoplot() + ylim(0, 1)
}
[1] "SegSumT"
[1] "SegTSeas"
[1] "SegLowFlow"
[1] "DSDist"
[1] "DSMaxSlope"
[1] "USAvgT"
[1] "USRainDays"
[1] "USSlope"
[1] "USNative"
[1] "DSDam"
[1] "Method"
[1] "LocSed"
do.call("grid.arrange", p)

7 Assessing accuracy

ROC curve: receiver operating characteristic curve performance of a classication model at all thresholds. It plots two components: y-axis: True Positive rate (TP/(TP+FN)) x-axis: False Positive rate (FP/(FP+TN)) AUC: Area under ROC curve an aggregate measure of the performance under all thresolds ranges from 0 (0% correct) to 1 (100% correct). max TPR+TNR at: the threshold. Values above this should be considered 1, otherwise 0 (when classifying)

leathwick_test %>%
  bind_cols(Pred = predict(leathwick.gbm,
    newdata = leathwick_test,
    n.tree = best.iter, type = "response"
  )) %>%
  ggplot() +
  geom_boxplot(aes(y = Pred, x = as.factor(Angaus_obs))) +
  geom_point(aes(y = Pred, x = as.factor(Angaus_obs)), position = position_jitter(width = 0.05))

# preds <- predict.gbm(leathwick.gbm, newdata=leathwick_test,
#                     n.trees=best.iter,  type='response')
preds <- leathwick_test |>
  bind_cols(Pred = predict(leathwick.gbm,
    newdata = leathwick_test,
    n.tree = best.iter, type = "response"
  ))
pres <- preds |>
  filter(Angaus_obs == 1) |>
  pull(Pred)
abs <- preds |>
  filter(Angaus_obs == 0) |>
  pull(Pred)
e <- dismo::evaluate(p = pres, a = abs)
e
class          : ModelEvaluation 
n presences    : 107 
n absences     : 393 
AUC            : 0.8583387 
cor            : 0.5255761 
max TPR+TNR at : 0.1246326 

7.1 Plot spatial distribution of eels

data(Anguilla_grids)
leathwick.grid <- Anguilla_grids
glimpse(leathwick.grid)
Formal class 'RasterBrick' [package "raster"] with 13 slots
Warning: Not a validObject(): no slot of name "srs" for this object of class
"RasterBrick"
  ..@ history   : list()
  ..@ file      :Formal class '.RasterFile' [package "raster"] with 13 slots
Warning: Not a validObject(): no slot of name "NAchanged" for this object of
class ".RasterFile"
  ..@ data      :Formal class '.MultipleRasterData' [package "raster"] with 14 slots
  ..@ legend    :Formal class '.RasterLegend' [package "raster"] with 5 slots
  ..@ title     : chr(0) 
  ..@ extent    :Formal class 'Extent' [package "raster"] with 4 slots
  ..@ rotated   : logi FALSE
  ..@ rotation  :Formal class '.Rotation' [package "raster"] with 2 slots
  ..@ ncols     : int 250
  ..@ nrows     : int 196
  ..@ crs       :Formal class 'CRS' [package "sp"] with 1 slot
  ..@ z         : list()
  ..@ layernames: chr [1:11] "DSDam" "DSDist" "DSMaxSlope" "LocSed" ...
  ..@ NA        : NULL
  ..$ layernames: chr [1:11] "DSDam" "DSDist" "DSMaxSlope" "LocSed" ...
plot(leathwick.grid)

Method <- factor("electric", levels = levels(leathwick$Method))
Method <- as.data.frame(Method)

fit <- predict(leathwick.grid, leathwick.gbm,
  const = Method,
  n.trees = best.iter, type = "response"
)
# fit <- mask(fit,  raster(leathwick.grid, 1))
library(stars)
Loading required package: abind
Loading required package: sf
Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
fit <- stars::st_as_stars(fit)


ggplot() +
  geom_stars(data = fit) +
  scale_fill_gradient(low = "red", high = "green", "Probability\nof occurrance", na.value = NA) +
  coord_sf(expand = FALSE) +
  theme_bw()
Warning: Removed 40942 rows containing missing values or values outside the scale range
(`geom_raster()`).

8 References

Leathwick, J. R., J. Elith, W. L. Chadderton, D. Rowe, and T. Hastie. 2008. “Dispersal, Disturbance and the Contrasting Biogeographies of New Zealand’s Diadromous and Non-Diadromous Fish Species.” Journal of Biogeography 35 (8): 1481–97. https://doi.org/https://doi.org/10.1111/j.1365-2699.2008.01887.x.